TASK #1: UNDERSTAND THE PROBLEM STATEMENT AND BUSINESS CASE

image.png

image.png

TASK #2: IMPORT LIBRARIES/DATASETS AND PERFORM EXPLORATORY DATA ANALYSIS

In [58]:
!pip install xgboost
Collecting xgboost
  Downloading https://files.pythonhosted.org/packages/63/85/15ec7550b867a745f2e4fc929c72f053feb620b2d2117ab344afbfa5a53b/xgboost-1.3.0.post0-py3-none-win_amd64.whl (95.2MB)
Requirement already satisfied: numpy in c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages (from xgboost) (1.18.2)
Requirement already satisfied: scipy in c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages (from xgboost) (1.4.1)
Installing collected packages: xgboost
Successfully installed xgboost-1.3.0.post0
tensorboard 2.1.1 has requirement setuptools>=41.0.0, but you'll have setuptools 39.0.1 which is incompatible.
google-auth 1.12.0 has requirement setuptools>=40.3.0, but you'll have setuptools 39.0.1 which is incompatible.
You are using pip version 10.0.1, however version 20.3.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
In [28]:
!pip install wordcloud
Collecting wordcloud
  Downloading https://files.pythonhosted.org/packages/a7/f0/f7384c323c1fc7149573455f9633ef063c7b4d85c64d419b711bbca9ed29/wordcloud-1.8.1-cp37-cp37m-win_amd64.whl (154kB)
Requirement already satisfied: matplotlib in c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages (from wordcloud) (3.2.1)
Requirement already satisfied: numpy>=1.6.1 in c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages (from wordcloud) (1.18.2)
Collecting pillow (from wordcloud)
  Downloading https://files.pythonhosted.org/packages/2b/65/e4a5130b4162d20ed99ff096549a04d18f050cfcdb16fe1643ac751c0181/Pillow-8.0.1-cp37-cp37m-win_amd64.whl (2.1MB)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages (from matplotlib->wordcloud) (1.1.0)
Requirement already satisfied: python-dateutil>=2.1 in c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages (from matplotlib->wordcloud) (2.8.1)
Requirement already satisfied: cycler>=0.10 in c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages (from matplotlib->wordcloud) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages (from matplotlib->wordcloud) (2.4.6)
Requirement already satisfied: setuptools in c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages (from kiwisolver>=1.0.1->matplotlib->wordcloud) (39.0.1)
Requirement already satisfied: six>=1.5 in c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages (from python-dateutil>=2.1->matplotlib->wordcloud) (1.14.0)
Installing collected packages: pillow, wordcloud
Successfully installed pillow-8.0.1 wordcloud-1.8.1
tensorboard 2.1.1 has requirement setuptools>=41.0.0, but you'll have setuptools 39.0.1 which is incompatible.
google-auth 1.12.0 has requirement setuptools>=40.3.0, but you'll have setuptools 39.0.1 which is incompatible.
  The script wordcloud_cli.exe is installed in 'c:\users\administrator\appdata\local\programs\python\python37\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
You are using pip version 10.0.1, however version 20.3.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
In [2]:
!pip install plotly
Collecting plotly
  Downloading https://files.pythonhosted.org/packages/c9/09/315462259ab7b60a3d4b7159233ed700733c87d889755bdc00a9fb46d692/plotly-4.14.1-py2.py3-none-any.whl (13.2MB)
Collecting retrying>=1.3.3 (from plotly)
  Downloading https://files.pythonhosted.org/packages/44/ef/beae4b4ef80902f22e3af073397f079c96969c69b2c7d52a57ea9ae61c9d/retrying-1.3.3.tar.gz
Requirement already satisfied: six in c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages (from plotly) (1.14.0)
Building wheels for collected packages: retrying
  Running setup.py bdist_wheel for retrying: started
  Running setup.py bdist_wheel for retrying: finished with status 'done'
  Stored in directory: C:\Users\Administrator\AppData\Local\pip\Cache\wheels\d7\a9\33\acc7b709e2a35caa7d4cae442f6fe6fbf2c43f80823d46460c
Successfully built retrying
Installing collected packages: retrying, plotly
Successfully installed plotly-4.14.1 retrying-1.3.3
tensorboard 2.1.1 has requirement setuptools>=41.0.0, but you'll have setuptools 39.0.1 which is incompatible.
google-auth 1.12.0 has requirement setuptools>=40.3.0, but you'll have setuptools 39.0.1 which is incompatible.
You are using pip version 10.0.1, however version 20.3.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
In [4]:
import numpy as np # Multi-dimensional array object
import pandas as pd # Data Manipulation
import seaborn as sns # Data Visualization
import matplotlib.pyplot as plt # Data Visualization
import plotly.express as px # Interactive Data Visualization
from jupyterthemes import jtplot # Jupyter Notebook Theme
jtplot.style(theme = 'monokai', context = 'notebook', ticks = True, grid = False) 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot # Offline version of the Plotly modules.
In [5]:
# Read the CSV file 
In [6]:
# Load the top 10 instances
Out[6]:
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All $36,945 $33,337 3.5 6.0 265 17 23 4451 106 189
1 Acura RSX Type S 2dr Sedan Asia Front $23,820 $21,761 2.0 4.0 200 24 31 2778 101 172
2 Acura TSX 4dr Sedan Asia Front $26,990 $24,647 2.4 4.0 200 22 29 3230 105 183
3 Acura TL 4dr Sedan Asia Front $33,195 $30,299 3.2 6.0 270 20 28 3575 108 186
4 Acura 3.5 RL 4dr Sedan Asia Front $43,755 $39,014 3.5 6.0 225 18 24 3880 115 197
5 Acura 3.5 RL w/Navigation 4dr Sedan Asia Front $46,100 $41,100 3.5 6.0 225 18 24 3893 115 197
6 Acura NSX coupe 2dr manual S Sports Asia Rear $89,765 $79,978 3.2 6.0 290 17 24 3153 100 174
7 Audi A4 1.8T 4dr Sedan Europe Front $25,940 $23,508 1.8 4.0 170 22 31 3252 104 179
8 Audi A41.8T convertible 2dr Sedan Europe Front $35,940 $32,506 1.8 4.0 170 23 30 3638 105 180
9 Audi A4 3.0 4dr Sedan Europe Front $31,840 $28,846 3.0 6.0 220 20 28 3462 104 179
In [7]:
# Load the bottom 10 instances 
Out[7]:
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
418 Volvo S60 2.5 4dr Sedan Europe All $31,745 $29,916 2.5 5.0 208 20 27 3903 107 180
419 Volvo S60 T5 4dr Sedan Europe Front $34,845 $32,902 2.3 5.0 247 20 28 3766 107 180
420 Volvo S60 R 4dr Sedan Europe All $37,560 $35,382 2.5 5.0 300 18 25 3571 107 181
421 Volvo S80 2.9 4dr Sedan Europe Front $37,730 $35,542 2.9 6.0 208 20 28 3576 110 190
422 Volvo S80 2.5T 4dr Sedan Europe All $37,885 $35,688 2.5 5.0 194 20 27 3691 110 190
423 Volvo C70 LPT convertible 2dr Sedan Europe Front $40,565 $38,203 2.4 5.0 197 21 28 3450 105 186
424 Volvo C70 HPT convertible 2dr Sedan Europe Front $42,565 $40,083 2.3 5.0 242 20 26 3450 105 186
425 Volvo S80 T6 4dr Sedan Europe Front $45,210 $42,573 2.9 6.0 268 19 26 3653 110 190
426 Volvo V40 Wagon Europe Front $26,135 $24,641 1.9 4.0 170 22 29 2822 101 180
427 Volvo XC70 Wagon Europe All $35,145 $33,112 2.5 5.0 208 20 27 3823 109 186
In [8]:
# Display the feature columns
Out[8]:
Index(['Make', 'Model', 'Type', 'Origin', 'DriveTrain', 'MSRP', 'Invoice',
       'EngineSize', 'Cylinders', 'Horsepower', 'MPG_City', 'MPG_Highway',
       'Weight', 'Wheelbase', 'Length'],
      dtype='object')
In [9]:
# Check the shape of the dataframe
Out[9]:
(428, 15)
In [10]:
# Check if any missing values are present in the dataframe
Out[10]:
Make           0
Model          0
Type           0
Origin         0
DriveTrain     0
MSRP           0
Invoice        0
EngineSize     0
Cylinders      2
Horsepower     0
MPG_City       0
MPG_Highway    0
Weight         0
Wheelbase      0
Length         0
dtype: int64
In [11]:
 
In [12]:
# Obtain the summary of the dataframe
<class 'pandas.core.frame.DataFrame'>
Int64Index: 426 entries, 0 to 427
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Make         426 non-null    object 
 1   Model        426 non-null    object 
 2   Type         426 non-null    object 
 3   Origin       426 non-null    object 
 4   DriveTrain   426 non-null    object 
 5   MSRP         426 non-null    object 
 6   Invoice      426 non-null    object 
 7   EngineSize   426 non-null    float64
 8   Cylinders    426 non-null    float64
 9   Horsepower   426 non-null    int64  
 10  MPG_City     426 non-null    int64  
 11  MPG_Highway  426 non-null    int64  
 12  Weight       426 non-null    int64  
 13  Wheelbase    426 non-null    int64  
 14  Length       426 non-null    int64  
dtypes: float64(2), int64(6), object(7)
memory usage: 53.2+ KB
In [13]:
# Convert MSRP and Invoice datatype to integer so we need to remove $ sign and comma (,) from these 2 columns

car_df["MSRP"] = car_df["MSRP"].str.replace("$", "")
car_df["MSRP"] = car_df["MSRP"].str.replace(",", "")
car_df["MSRP"] = car_df["MSRP"].astype(int)
In [14]:
car_df["MSRP"]
Out[14]:
0      36945
1      23820
2      26990
3      33195
4      43755
       ...  
423    40565
424    42565
425    45210
426    26135
427    35145
Name: MSRP, Length: 426, dtype: int32

MINI CHALLENGE #1:

  • Repeat the same procedure for the invoice column
In [ ]:
 
In [15]:
# Let's view the updated MSRP and Invoice Columns
Out[15]:
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945 $33,337 3.5 6.0 265 17 23 4451 106 189
1 Acura RSX Type S 2dr Sedan Asia Front 23820 $21,761 2.0 4.0 200 24 31 2778 101 172
2 Acura TSX 4dr Sedan Asia Front 26990 $24,647 2.4 4.0 200 22 29 3230 105 183
3 Acura TL 4dr Sedan Asia Front 33195 $30,299 3.2 6.0 270 20 28 3575 108 186
4 Acura 3.5 RL 4dr Sedan Asia Front 43755 $39,014 3.5 6.0 225 18 24 3880 115 197
In [16]:
# Display the updated summary of the dataframe
<class 'pandas.core.frame.DataFrame'>
Int64Index: 426 entries, 0 to 427
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Make         426 non-null    object 
 1   Model        426 non-null    object 
 2   Type         426 non-null    object 
 3   Origin       426 non-null    object 
 4   DriveTrain   426 non-null    object 
 5   MSRP         426 non-null    int32  
 6   Invoice      426 non-null    object 
 7   EngineSize   426 non-null    float64
 8   Cylinders    426 non-null    float64
 9   Horsepower   426 non-null    int64  
 10  MPG_City     426 non-null    int64  
 11  MPG_Highway  426 non-null    int64  
 12  Weight       426 non-null    int64  
 13  Wheelbase    426 non-null    int64  
 14  Length       426 non-null    int64  
dtypes: float64(2), int32(1), int64(6), object(6)
memory usage: 51.6+ KB

MINI CHALLENGE #2:

  • What is the maximum price of the used car?
  • What is the minimum price of the used car?
In [ ]:
 
In [ ]:
 

TASK #3: PERFORM DATA VISUALIZATION - PART #1

In [17]:
# scatterplots for joint relationships and histograms for univariate distributions
Out[17]:
<seaborn.axisgrid.PairGrid at 0x21631af6908>
In [18]:
# Let's view various makes of the cars
Out[18]:
array(['Acura', 'Audi', 'BMW', 'Buick', 'Cadillac', 'Chevrolet',
       'Chrysler', 'Dodge', 'Ford', 'GMC', 'Honda', 'Hummer', 'Hyundai',
       'Infiniti', 'Isuzu', 'Jaguar', 'Jeep', 'Kia', 'Land Rover',
       'Lexus', 'Lincoln', 'MINI', 'Mazda', 'Mercedes-Benz', 'Mercury',
       'Mitsubishi', 'Nissan', 'Oldsmobile', 'Pontiac', 'Porsche', 'Saab',
       'Saturn', 'Scion', 'Subaru', 'Suzuki', 'Toyota', 'Volkswagen',
       'Volvo'], dtype=object)
In [19]:
fig = px.histogram(car_df, x = "Make",
                  labels = {"Make":"Manufacturer"},
                  title = "MAKE OF THE CAR",
                  color_discrete_sequence = ["maroon"])
                  
fig.show()
In [20]:
# Let's view various types of the cars
car_df.Type.unique()
Out[20]:
array(['SUV', 'Sedan', 'Sports', 'Wagon', 'Truck', 'Hybrid'], dtype=object)
In [21]:
fig = px.histogram(car_df, x = "Type",
                  labels = {"Type":"Type"},
                  title = "TYPE OF THE CAR",
                  color_discrete_sequence = ["blue"])
                  
fig.show()
In [22]:
# Let's plot the location
car_df.Origin.unique()
Out[22]:
array(['Asia', 'Europe', 'USA'], dtype=object)
In [23]:
fig = px.histogram(car_df, x = "Origin",
                  labels = {"Origin":"Origin"},
                  title = "LOCATION OF THE CAR SALES",
                  color_discrete_sequence = ["brown"])
                  
fig.show()
In [24]:
# Let's view the drivetrain of the cars
car_df.DriveTrain.unique()
Out[24]:
array(['All', 'Front', 'Rear'], dtype=object)
In [25]:
fig = px.histogram(car_df, x = "DriveTrain",
                  labels = {"DriveTrain":"Drivetrain"},
                  title = "DRIVETRAIN OF THE CAR",
                  color_discrete_sequence = ["BLACK"])
                  
fig.show()
In [26]:
# Plot the make of the car and its location
fig = px.histogram(car_df, x = "Make",
                  color = "Origin",
                  labels = {"Make":"Manufacturer"},
                  title = "MAKE OF THE CAR Vs LOCATION")
                  
fig.show()

MINI CHALLENGE #3:

  • Plot the plotly histogram of Make and Type of the car
  • Find out which manufacturer has high number of Sports type
  • Find out which manufacturers has Hybrid
In [ ]:
 

TASK #4: PERFORM DATA VISUALIZATION - PART #2

In [29]:
# Let's view the model of all used cars using WordCloud generator
from wordcloud import WordCloud, STOPWORDS
In [30]:
car_df
Out[30]:
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945 $33,337 3.5 6.0 265 17 23 4451 106 189
1 Acura RSX Type S 2dr Sedan Asia Front 23820 $21,761 2.0 4.0 200 24 31 2778 101 172
2 Acura TSX 4dr Sedan Asia Front 26990 $24,647 2.4 4.0 200 22 29 3230 105 183
3 Acura TL 4dr Sedan Asia Front 33195 $30,299 3.2 6.0 270 20 28 3575 108 186
4 Acura 3.5 RL 4dr Sedan Asia Front 43755 $39,014 3.5 6.0 225 18 24 3880 115 197
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
423 Volvo C70 LPT convertible 2dr Sedan Europe Front 40565 $38,203 2.4 5.0 197 21 28 3450 105 186
424 Volvo C70 HPT convertible 2dr Sedan Europe Front 42565 $40,083 2.3 5.0 242 20 26 3450 105 186
425 Volvo S80 T6 4dr Sedan Europe Front 45210 $42,573 2.9 6.0 268 19 26 3653 110 190
426 Volvo V40 Wagon Europe Front 26135 $24,641 1.9 4.0 170 22 29 2822 101 180
427 Volvo XC70 Wagon Europe All 35145 $33,112 2.5 5.0 208 20 27 3823 109 186

426 rows × 15 columns

In [31]:
text = car_df.Model.values
In [32]:
stopwords = set(STOPWORDS)
In [33]:
wc = WordCloud(background_color = "black", max_words = 2000, max_font_size = 100, random_state = 3, 
              stopwords = stopwords, contour_width = 3).generate(str(text))          
In [34]:
fig = plt.figure(figsize = (25, 15))
plt.imshow(wc, interpolation = "bilinear")
plt.axis("off")
plt.show()
In [35]:
# Obtain the correlation matrix
Out[35]:
MSRP EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
MSRP 1.000000 0.573238 0.649742 0.827296 -0.475916 -0.440523 0.447987 0.151665 0.171060
EngineSize 0.573238 1.000000 0.908002 0.793250 -0.717860 -0.725901 0.808707 0.638947 0.636015
Cylinders 0.649742 0.908002 1.000000 0.810341 -0.684402 -0.676100 0.742209 0.546730 0.547783
Horsepower 0.827296 0.793250 0.810341 1.000000 -0.677034 -0.647425 0.631758 0.387561 0.382386
MPG_City -0.475916 -0.717860 -0.684402 -0.677034 1.000000 0.940993 -0.740418 -0.508029 -0.504184
MPG_Highway -0.440523 -0.725901 -0.676100 -0.647425 0.940993 1.000000 -0.793615 -0.525457 -0.468756
Weight 0.447987 0.808707 0.742209 0.631758 -0.740418 -0.793615 1.000000 0.760857 0.689168
Wheelbase 0.151665 0.638947 0.546730 0.387561 -0.508029 -0.525457 0.760857 1.000000 0.889838
Length 0.171060 0.636015 0.547783 0.382386 -0.504184 -0.468756 0.689168 0.889838 1.000000
In [36]:
 
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x21638863d68>

MINI CHALLENGE #4:

  • Comment on the correlation matrix, which feature has the highest positive correlation with MSRP?
In [ ]:
 

TASK #5: PREPARE THE DATA BEFORE MODEL TRAINING

In [37]:
car_df.head()
Out[37]:
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945 $33,337 3.5 6.0 265 17 23 4451 106 189
1 Acura RSX Type S 2dr Sedan Asia Front 23820 $21,761 2.0 4.0 200 24 31 2778 101 172
2 Acura TSX 4dr Sedan Asia Front 26990 $24,647 2.4 4.0 200 22 29 3230 105 183
3 Acura TL 4dr Sedan Asia Front 33195 $30,299 3.2 6.0 270 20 28 3575 108 186
4 Acura 3.5 RL 4dr Sedan Asia Front 43755 $39,014 3.5 6.0 225 18 24 3880 115 197
In [38]:
# Perform One-Hot Encoding for "Make", "Model", "Type", "Origin", and "DriveTrain"
In [39]:
 
Out[39]:
MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length ... Type_Sedan Type_Sports Type_Truck Type_Wagon Origin_Asia Origin_Europe Origin_USA DriveTrain_All DriveTrain_Front DriveTrain_Rear
0 36945 $33,337 3.5 6.0 265 17 23 4451 106 189 ... 0 0 0 0 1 0 0 1 0 0
1 23820 $21,761 2.0 4.0 200 24 31 2778 101 172 ... 1 0 0 0 1 0 0 0 1 0
2 26990 $24,647 2.4 4.0 200 22 29 3230 105 183 ... 1 0 0 0 1 0 0 0 1 0
3 33195 $30,299 3.2 6.0 270 20 28 3575 108 186 ... 1 0 0 0 1 0 0 0 1 0
4 43755 $39,014 3.5 6.0 225 18 24 3880 115 197 ... 1 0 0 0 1 0 0 0 1 0

5 rows × 483 columns

In [40]:
# Invoice feature does not contribute to car price prediction 
In [41]:
 
Out[41]:
MSRP EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length Make_Acura ... Type_Sedan Type_Sports Type_Truck Type_Wagon Origin_Asia Origin_Europe Origin_USA DriveTrain_All DriveTrain_Front DriveTrain_Rear
0 36945 3.5 6.0 265 17 23 4451 106 189 1 ... 0 0 0 0 1 0 0 1 0 0
1 23820 2.0 4.0 200 24 31 2778 101 172 1 ... 1 0 0 0 1 0 0 0 1 0
2 26990 2.4 4.0 200 22 29 3230 105 183 1 ... 1 0 0 0 1 0 0 0 1 0
3 33195 3.2 6.0 270 20 28 3575 108 186 1 ... 1 0 0 0 1 0 0 0 1 0
4 43755 3.5 6.0 225 18 24 3880 115 197 1 ... 1 0 0 0 1 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
423 40565 2.4 5.0 197 21 28 3450 105 186 0 ... 1 0 0 0 0 1 0 0 1 0
424 42565 2.3 5.0 242 20 26 3450 105 186 0 ... 1 0 0 0 0 1 0 0 1 0
425 45210 2.9 6.0 268 19 26 3653 110 190 0 ... 1 0 0 0 0 1 0 0 1 0
426 26135 1.9 4.0 170 22 29 2822 101 180 0 ... 0 0 0 1 0 1 0 0 1 0
427 35145 2.5 5.0 208 20 27 3823 109 186 0 ... 0 0 0 1 0 1 0 1 0 0

426 rows × 482 columns

In [42]:
df_data.shape
Out[42]:
(426, 482)
In [43]:
# Feeding input features to X and output (MSRP) to y
X = df_data.drop("MSRP", axis = 1)
y = df_data["MSRP"]
In [44]:
X = np.array(X)
In [45]:
y = np.array(y)
In [46]:
from sklearn.model_selection import train_test_split
In [47]:
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size = 0.2)

MINI CHALLENGE #5:

  • Verify that the split was successful
In [ ]:
 

TASK #5: TRAIN AND EVALUATE A MULTIPLE LINEAR REGRESSION

image.png

image.png

image.png

image.png

In [48]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, accuracy_score
from math import sqrt
In [49]:
 
Out[49]:
LinearRegression()
In [50]:
accuracy_LinearRegression = LinearRegression_model.score(X_test, y_test)
accuracy_LinearRegression
Out[50]:
0.8727927469649763

TASK #6: TRAIN AND EVALUATE A DECISION TREE AND RANDOM FOREST MODELS

In [51]:
# Photo Credits:
# https://creazilla.com/nodes/22202-giraffe-clipart 
# https://pixy.org/4569488/ 
# https://pixabay.com/illustrations/monkey-animal-gorilla-zoo-nature-4187960/ 
# https://creazilla.com/nodes/15581-running-tiger-clipart 

image.png

image.png

image.png

In [52]:
from sklearn.tree import DecisionTreeRegressor
Out[52]:
DecisionTreeRegressor()
In [53]:
accuracy_DecisionTree = DecisionTree_model.score(X_test, y_test)
accuracy_DecisionTree
Out[53]:
0.7505060433960574
In [54]:
from sklearn.ensemble import RandomForestRegressor
In [55]:
 
Out[55]:
RandomForestRegressor(max_depth=5, n_estimators=5)
In [56]:
accuracy_RandomForest= RandomForest_model.score(X_test, y_test)
accuracy_RandomForest
Out[56]:
0.8343625774487002

TASK #7: UNDERSTAND THE THEORY AND INTUITION BEHING XG-BOOST ALGORITHM

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

TASK #8: TRAIN AN XG-BOOST REGRESSOR MODEL

In [59]:
from xgboost import XGBRegressor
In [60]:
 
Out[60]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=2, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)
In [61]:
accuracy_XGBoost = model.score(X_test, y_test)
accuracy_XGBoost
Out[61]:
0.8993720891756513

MINI CHALLENGE #6:

  • Which regressor performed best?
In [ ]:
 

TASK #9: COMPARE MODELS AND CALCULATE REGRESSION KPIs

In [62]:
y_predict_linear = LinearRegression_model.predict(X_test)

fig = sns.regplot(y_predict_linear, y_test, color = 'red', marker = "^")
fig.set(title = "Linear Regression Model", xlabel = "Predicted Price of the used cars ($)", ylabel = "Actual Price of the used cars ($)")
Out[62]:
[Text(0, 0.5, 'Actual Price of the used cars ($)'),
 Text(0.5, 0, 'Predicted Price of the used cars ($)'),
 Text(0.5, 1.0, 'Linear Regression Model')]
In [63]:
RMSE= float(format(np.sqrt(mean_squared_error(y_test, y_predict_linear)), ".3f"))
MSE= mean_squared_error(y_test, y_predict_linear)
MAE= mean_absolute_error(y_test, y_predict_linear)
r2= r2_score(y_test, y_predict_linear)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2) 
RMSE = 6899.618 
MSE = 47604724.28139484 
MAE = 5082.029749213775 
R2 = 0.8727927469649763
In [64]:
y_predict_RandomForest = RandomForest_model.predict(X_test)

fig = sns.regplot(y_predict_RandomForest, y_test, color = 'blue', marker = "s")
fig.set(title = "Random Forest Regression Model", xlabel = "Predicted Price of the used cars ($)", ylabel= "Actual Price of the used cars ($)")
Out[64]:
[Text(0, 0.5, 'Actual Price of the used cars ($)'),
 Text(0.5, 0, 'Predicted Price of the used cars ($)'),
 Text(0.5, 1.0, 'Random Forest Regression Model')]
In [65]:
RMSE= float(format(np.sqrt(mean_squared_error(y_test, y_predict_RandomForest)), ".3f"))
MSE= mean_squared_error(y_test, y_predict_RandomForest)
MAE= mean_absolute_error(y_test, y_predict_RandomForest)
r2= r2_score(y_test, y_predict_RandomForest)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2) 
RMSE = 7873.146 
MSE = 61986432.715943724 
MAE = 5195.638274672215 
R2 = 0.8343625774487002
In [66]:
y_predict_XGBoost = model.predict(X_test)

fig = sns.regplot(y_predict_XGBoost, y_test, color = 'green', marker = "D")
fig.set(title = "XGBoost Model", xlabel = "Predicted Price of the used cars ($)", ylabel = "Actual Price of the used cars ($)")
Out[66]:
[Text(0, 0.5, 'Actual Price of the used cars ($)'),
 Text(0.5, 0, 'Predicted Price of the used cars ($)'),
 Text(0.5, 1.0, 'XGBoost Model')]
In [67]:
RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict_XGBoost)), ".3f"))
MSE = mean_squared_error(y_test, y_predict_XGBoost)
MAE = mean_absolute_error(y_test, y_predict_XGBoost)
r2 = r2_score(y_test, y_predict_XGBoost)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2) 
RMSE = 6136.607 
MSE = 37657946.66194855 
MAE = 4057.7332508175873 
R2 = 0.8993720891756513

From the ablove results, it is clearly shown that XGBoost model scores 94% accuracy which outperforms Linear Regression and Random Forest Regression models

EXCELLENT JOB!

MINI CHALLENGES SOLUTIONS

MINI CHALLENGE #1 SOLUTION:

  • Repeat the same procedure for the invoice column
In [68]:
car_df["Invoice"] = car_df["Invoice"].str.replace("$", "")
car_df["Invoice"] = car_df["Invoice"].str.replace(",", "")
car_df["Invoice"] = car_df["Invoice"].astype(int)

MINI CHALLENGE #2 SOLUTION:

  • What is the maximum price of the used car?
  • What is the minimum price of the used car?
In [69]:
print(car_df.MSRP.max())
print(car_df.MSRP.min())
192465
10280
In [70]:
# Display the statistical details of the dataframe
car_df.describe()
Out[70]:
MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
count 426.000000 426.000000 426.000000 426.000000 426.000000 426.000000 426.000000 426.000000 426.000000 426.000000
mean 32804.549296 30040.654930 3.205634 5.807512 215.877934 20.070423 26.854460 3580.474178 108.164319 186.420188
std 19472.460825 17679.430122 1.103520 1.558443 71.991040 5.248616 5.752335 759.870073 8.330030 14.366611
min 10280.000000 9875.000000 1.400000 3.000000 73.000000 10.000000 12.000000 1850.000000 89.000000 143.000000
25% 20324.750000 18836.000000 2.400000 4.000000 165.000000 17.000000 24.000000 3111.250000 103.000000 178.000000
50% 27807.500000 25521.500000 3.000000 6.000000 210.000000 19.000000 26.000000 3476.000000 107.000000 187.000000
75% 39225.000000 35754.750000 3.900000 6.000000 255.000000 21.750000 29.000000 3979.250000 112.000000 194.000000
max 192465.000000 173560.000000 8.300000 12.000000 500.000000 60.000000 66.000000 7190.000000 144.000000 238.000000

MINI CHALLENGE #3 SOLUTION:

  • Plot the plotly histogram of Make and Type of the car
  • Find out which manufacturer has high number of Sport car type
  • Find out which manufacturers has Hybrid
In [71]:
fig = px.histogram(car_df, x = "Make",
                  color = "Type",
                  labels = {"Make":"Manufacturer"},
                  title = "MAKE AND TYPE OF THE CAR",
                  opacity = 1)
                  
fig.show()

-Porsche

-Honda and Toyota

MINI CHALLENGE #4 SOLUTION:

  • Comment on the correlation matrix, which feature has the highest positive correlation with MSRP?
In [72]:
# Positive correlation between engine size and number of cylinders
# Positive correlation between horsepower and number of cylinders
# highest positive correlation with MSRP is = horsepower

MINI CHALLENGE #5 SOLUTION:

  • Verify that the split was successful
In [73]:
X_train.shape
Out[73]:
(340, 481)
In [74]:
X_test.shape
Out[74]:
(86, 481)

MINI CHALLENGE #6 SOLUTION:

  • Which regressor performed best?
In [ ]:
# XG-boost